Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove TimeDelta serialization_type parameter; deserialize from float; improve serialization perf and accuracy #2654

Merged
merged 7 commits into from
Dec 30, 2024

Conversation

ddelange
Copy link
Contributor

@ddelange ddelange commented Dec 10, 2024

This PR allows propagating floats/strings with decimals to TimeDelta.deserialize without silently downcasting the input to int. Currently, serialization_type is not only being used for serialization, but also being used for deserialization, which adds no added value: a float cast can always safely be used for deserialization and is backwards compatible (but fixes this bug).

originally reported here: sloria/environs#372

MRE:

>>> from marshmallow import fields
>>> field = fields.TimeDelta()
>>> field.deserialize(12.9)
datetime.timedelta(seconds=12)

After this PR:

>>> from marshmallow import fields
>>> field = fields.TimeDelta()
>>> field.deserialize(12.9)
datetime.timedelta(seconds=12, microseconds=900000)

I added one test for string input, and changed the existing test for float input.

There are already math.isclose tests when passing floats to microseconds (which will be rounded by the datetime.timedelta constructor).

Copy link
Member

@sloria sloria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR ❤️ This seems like the right behavior.

Lgtm but I'll give @lafrech some time to take a look in case I'm not thinking of some case where end users might rely on the current behavior.

@lafrech
Copy link
Member

lafrech commented Dec 10, 2024

[I enabled issues again. The spam streak is over on flask-smorest. Let's hope it is here too.]

I don't have a full understanding of the issue but in case it went unnoticed, here are related issues/PRs:

Also, the docstring should be updated if we merge.

@ddelange
Copy link
Contributor Author

Hi @lafrech 👋

You're right, the docstring needed some work. Please see the last commit. I hope that also clarifies the bug that this PR fixes: needlessly using serialization_type during deserialization.

@ddelange ddelange force-pushed the timedelta-deserialize branch from ea0509a to 0755fe1 Compare December 10, 2024 22:51
@sloria
Copy link
Member

sloria commented Dec 11, 2024

Docs update looks good to me. Thanks @ddelange .

Thanks for digging up the original issue and PRs @lafrech. Looking back at those, I think allowing truncation to an integer was an oversight on my part rather than an explicit decision. Even the ISO8601 spec mentioned in #105 (comment) allows for fractional components:

The smallest value used may also have a decimal fraction,[39] as in "P0.5Y" to indicate half a year. This decimal fraction may be specified with either a comma or a full stop, as in "P0,5Y" or "P0.5Y".

@lafrech @deckar01 Do we consider this a breaking change? Even though this is a much smaller change than previous major version bumps, I'm not opposed to a major version bump with some explanatory text in the changelog that says that the change is very limited in scope and only affects users of the TimeDelta field.

@ddelange
Copy link
Contributor Author

fwiw, this change only affects (helps) users that were serializing using serialization_type=float, but then not taking care when deserializing.

I would say this does not warrant a major version bump, but rather a bugfix release.

@sloria
Copy link
Member

sloria commented Dec 11, 2024

It would also affect users who didn't explicitly pass serialization_type or passed serialization_type=int, right (as your MRE shows)? Users who were relying on floats getting truncated would experience a breakage (not saying this is likely or desirable, but it's possible).

@ddelange
Copy link
Contributor Author

right. the good ol' rely on a bug motto. programming sucks...

@sloria
Copy link
Member

sloria commented Dec 11, 2024

lol, yeah... i guess i'm not 100% sure it's a bug, given that there was an explicit test for the current behavior. i'm like 80% sure it was oversight, but it was long ago so want to give a bit of time for others to chime in before shipping the change

Copy link
Member

@lafrech lafrech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure.

One of our concerns when designing a field is how values round-trip.

Current behaviour lets the user decide whether the serialized version should be int or float, and considers it is being given serialized values accordingly. If values are serialized as int, then the type that is expected as input when deserializing is int.

The original name of the parameter was serde_type, not very nice but showing it works both ways.

Before this PR:

    # TimeDelta(serialization_type=int)
    12 -> timedelta(seconds=12) -> 12
    12.42 -> timedelta(seconds=12) -> 12

    # TimeDelta(serialization_type=float)
    12 -> timedelta(seconds=12) -> 12.0
    12.42 -> timedelta(seconds=12, microseconds=420000) -> 12.42

After this PR:

    # TimeDelta(serialization_type=int)
    12 -> timedelta(seconds=12) -> 12
    12.42 -> timedelta(seconds=12, microseconds=420000) -> 12

    # TimeDelta(serialization_type=float)
    12 -> timedelta(seconds=12) -> 12.0
    12.42 -> timedelta(seconds=12, microseconds=420000) -> 12.42

From this perspective, I'm not sure the latter is less surprising behaviour.

The PR makes the field a bit more permissive on inputs, accepting floats when expecting integers. This is not in line with Integer field which truncates floats.

If floats are to be expected, then serialization_type=float should be used.

If the issue is that TimeDelta defaults to int then maybe the default could be changed in next major version. Until then, nothing prevents the user to specify float, right?

I might be totally mistaken, in which case please ignore.

src/marshmallow/fields.py Show resolved Hide resolved
Allow (de)serialization to `float` through use of a new `serialization_type` parameter.
`int` is the default to retain previous behaviour.
Allow serialization to `float` through use of a new `serialization_type` parameter.
Defaults to `int` for backwards compatibility.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a new versionchanged line here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never truncate a `float` to `int` during deserialization.

too concise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe

Never truncate a `float` to an `int` before deserializing it into a :class:`datetime.timedelta`.

@ddelange
Copy link
Contributor Author

Thanks for your elaborate review!

From this perspective, I'm not sure the latter is less surprising behaviour.

To me the latter definitely looks 'right'. From another perspective: that float came from marshmallow (starting off with a timedelta). It got serialized and sent elsewhere for deserialization. When the receiving party deserializes, in the latter example he doesn't need prior knowledge about the sender's settings. Deserialization has now become robust.

@lafrech
Copy link
Member

lafrech commented Dec 12, 2024

Is there anything wrong with TimeDelta(serialization_type=float), apart from the fact that it is not the default?

I mean does the change in this PR allow a use case that can't be achieved with current parameters? Or is the only benefit to make it the default choice?

@ddelange
Copy link
Contributor Author

re: anything wrong - my argument here is that serialization_type should be used only for serialization and not for deserialization. but that's about it :)

@lafrech
Copy link
Member

lafrech commented Dec 12, 2024

So would it be fine if the parameter was named serialization_deserialization_type, de_serialization_type or serialized_type of whatever we can come up with that makes it explicit it works both ways?

If it is just that, we could use a better naming and deprecate the old one.

@ddelange
Copy link
Contributor Author

from my side there's no need for a breaking change here, this change just seemed like a nice robustness improvement to me (not considering this breaking).

it only came up because a user was surprised that his float was being silently truncated when feeding into @sloria's environs package. since that package only ever deserializes (values from unknown sources), this seemed like the right way forward. but feel free to close / take over if you do consider this a 'bad' change, no hard feelings :)

will look uglier than now, but environs could fix this without any PR om marshmallow by subclassing TimeDelta.__init__

@sloria
Copy link
Member

sloria commented Dec 18, 2024

I generally agree with @ddelange's comment in #2654 (comment), and the proposed change certainly makes sense for the use case in environs. On one hand, we might as well be robust on input. Boolean will accept various strings and integers while outputting to a bool. On the other hand, marshmallow doesn't always attempt to be lenient on inputs. For example, Date and DateTime assume that the input and output formats are the same.

@lafrech @deckar01 Any strong opinions?

@sloria
Copy link
Member

sloria commented Dec 18, 2024

Another possible path would be to

  1. Add a deserialization_type to this PR and make it default to int (backwards-compatible)
  2. Add a warning if a deserializing a float

something like

diff --git a/src/marshmallow/fields.py b/src/marshmallow/fields.py
index b564aee..19d82f6 100644
--- a/src/marshmallow/fields.py
+++ b/src/marshmallow/fields.py
@@ -1487,6 +1487,8 @@ def __init__(
         self,
         precision: str = SECONDS,
         serialization_type: type[int | float] = int,
+        *,
+        deserialization_type: type[int | float] = int,
         **kwargs,
     ):
         precision = precision.lower()
@@ -1508,9 +1510,12 @@ def __init__(
 
         if serialization_type not in (int, float):
             raise ValueError("The serialization type must be one of int or float")
+        if deserialization_type not in (int, float):
+            raise ValueError("The deserialization type must be one of int or float")
 
         self.precision = precision
         self.serialization_type = serialization_type
+        self.deserialization_type = deserialization_type
         super().__init__(**kwargs)
 
     def _serialize(self, value, attr, obj, **kwargs):
@@ -1527,8 +1532,15 @@ def _serialize(self, value, attr, obj, **kwargs):
         return value.total_seconds() / base_unit.total_seconds()
 
     def _deserialize(self, value, attr, data, **kwargs):
+        if isinstance(value, float) and self.deserialization_type is not float:
+            warnings.warn(
+                f"float value for attribute {attr!r} will truncated to an integer value in {self.precision}. "
+                + "It will be deserialized as a float in marshmallow 3.25.0.",
+                UserWarning,
+                stacklevel=2,
+            )
         try:
-            value = float(value)
+            value = self.deserialization_type(value)
         except (TypeError, ValueError) as error:
             raise self.make_error("invalid") from error
  1. Release this in 3.24.0.
  2. Change the default deserialization_type to float and remove the warning.
  3. Release 3.25.0

this warns potentially-affected users of the change. but maybe it's not worth the added API surface? ¯\(ツ)

@deckar01
Copy link
Member

total_seconds() returns a float, so that should be the default. I don’t think it needs an extra arg (deserialize from == serialize to). Truncation was documented and tested, so I think it needs a major release, even though the use case is questionable.

We might consider deprecating the type arg while we are at it. The truncation can be enveloped or a custom field and does not seem to warrant a convenience API.

@ddelange
Copy link
Contributor Author

do you suggest to deprecate serialization_type altogether (major release) and make float the default? sounds good to me

@lafrech
Copy link
Member

lafrech commented Dec 21, 2024 via email

@sloria
Copy link
Member

sloria commented Dec 21, 2024

@lafrech if i'm understanding your suggestion correctly:

  1. change this PR to remove the serialization_type parameter and mark this for 4.0
  2. optionally merge other small-but-breaking changes

in environs, serialization_type can be defaulted to float for marshmallow 3 compatibility.

if this is the proposal, i'm good with it

@lafrech
Copy link
Member

lafrech commented Dec 21, 2024 via email

@ddelange ddelange force-pushed the timedelta-deserialize branch 2 times, most recently from fd171e8 to 210339f Compare December 22, 2024 16:50
Copy link
Contributor Author

@ddelange ddelange left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK! now targeting 4.0: I've demolished serialization_type (with deprecation warning), updated tests and added to docstring:

.. versionchanged:: 4.0.0
    Deprecate `serialization_type` parameter, always serialize to float.

src/marshmallow/fields.py Outdated Show resolved Hide resolved
@ddelange ddelange changed the title Avoid silent integer downcast in TimeDelta.deserialize Deprecate TimeDelta serialization_type parameter Dec 22, 2024
@ddelange ddelange force-pushed the timedelta-deserialize branch 8 times, most recently from 0742e3f to 8d166b1 Compare December 23, 2024 15:23
@ddelange ddelange force-pushed the timedelta-deserialize branch from 8d166b1 to 3470680 Compare December 23, 2024 15:24
Comment on lines +1499 to +1501
if precision not in self._unit_to_microseconds_mapping:
units = ", ".join(self._unit_to_microseconds_mapping)
msg = f"The precision must be one of: {units}."
Copy link
Contributor Author

@ddelange ddelange Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unified this msg formatting with the Enum field ref.

Comment on lines +1518 to +1521
# limit float arithmetics to a single division to minimize precision loss
microseconds: int = utils.timedelta_to_microseconds(value)
microseconds_per_unit: int = self._unit_to_microseconds_mapping[self.precision]
return microseconds / microseconds_per_unit
Copy link
Contributor Author

@ddelange ddelange Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apart from float precision issues with the old total_seconds approach, the new approach is 5x faster:

In [1]: from marshmallow import utils
   ...: from datetime import timedelta
   ...:
   ...: WEEKS = "weeks"
   ...: DAYS = "days"
   ...: HOURS = "hours"
   ...: MINUTES = "minutes"
   ...: SECONDS = "seconds"
   ...: MILLISECONDS = "milliseconds"
   ...: MICROSECONDS = "microseconds"
   ...:
   ...: _unit_to_microseconds_mapping = {
   ...:     WEEKS: 1000000 * 60 * 60 * 24 * 7,
   ...:     DAYS: 1000000 * 60 * 60 * 24,
   ...:     HOURS: 1000000 * 60 * 60,
   ...:     MINUTES: 1000000 * 60,
   ...:     SECONDS: 1000000,
   ...:     MILLISECONDS: 1000,
   ...:     MICROSECONDS: 1,
   ...: }
   ...: 
   ...: precision = WEEKS
   ...:
   ...: def serialize_old(value):
   ...:     base_unit = timedelta(**{precision: 1})
   ...:     return value.total_seconds() / base_unit.total_seconds()
   ...:
   ...: def serialize_new(value):
   ...:     microseconds: int = utils.timedelta_to_microseconds(value)
   ...:     microseconds_per_unit: int = _unit_to_microseconds_mapping[precision]
   ...:     return microseconds / microseconds_per_unit
   ...:
   ...: value = timedelta(weeks=1, microseconds=1)

In [2]: %timeit serialize_old(value)
558 ns ± 3.16 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [3]: %timeit serialize_new(value)
113 ns ± 0.583 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

@sloria sloria added this to the 4.0 milestone Dec 27, 2024
src/marshmallow/fields.py Outdated Show resolved Hide resolved
@sloria sloria changed the base branch from dev to 4.0 December 30, 2024 05:13
@ddelange ddelange requested review from lafrech and sloria December 30, 2024 11:40
@sloria sloria changed the title Deprecate TimeDelta serialization_type parameter Remove TimeDelta serialization_type parameter Dec 30, 2024
@sloria sloria changed the title Remove TimeDelta serialization_type parameter Remove TimeDelta serialization_type parameter; deserialize from float; improve serialization perf and accuracy Dec 30, 2024
@sloria sloria merged commit 0c90110 into marshmallow-code:4.0 Dec 30, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants